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Abstract 

One important issue commonly encountered in the analysis of microarray data is to 
decide which and how many genes should be selected for further studies. For discrimi- 
nant microarray data analyses based on statistical models, such as the logistic regression 
models, gene selection can be accomplished by a comparison of the maximum likelihood 
of the model given the real data, L{D\M), and the expected maximum likelihood of the 
model given an ensemble of surrogate data with randomly permuted label, L{Dq\M). Typ- 
ically, the computational burden for obtaining L{Do\M) is immense, often exceeding the 
limits of computing available resources by orders of magnitude. Here, we propose an ap- 
proach that circumvents such heavy computations by mapping the simulation problem to 
an extreme-value problem. We present the derivation of an asymptotic distribution of 
the extreme-value as well as its mean, median, and variance. Using this distribution, we 
propose two gene selection criteria, and we apply them to two microarray datasets and 
three classification tasks for illustration. 
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1 Introduction 

Discriminant microarray data analysis can be understood as a comparison of the ex- 
pression levels of samples from one group versus another group, such as disease tissues 
versus normal tissues, or one subtype of cancer versus another subtype (for a review, see 
jl3j). Discriminant analysis or classification can be carried out on a whole set of genes or 
on individual genes, and it has become increasingly clear that, for many classification tasks 
based on microarray data, it is not necessary to consider many genes simultaneously. In 
many cases it has been shown that a few genes are sufficient for classifying two groups of 
samples jaiSiniEniEniEZlEHllSnilHllSillHHlinS. Usually, even with a very small number 
of genes being included in a classification, these genes are jointly used in a multivariate 
fashion. However, in some cases, one or two genes are sufficient for a good classification 
Pm ISl This observation led to procedures that examine one gene at a time, rank 
the gene according to their classification ability, and select only the high-ranking genes 
for further studies, including new confirmation experiments j3J |321 Bl] • Some information 
could be lost by not considering genes jointly, but focusing on single genes often simplifies 
the biological interpretation of the results. 

Two single-gene classification methods that are often applied to the analysis of microar- 
ray data are the fold-change method jSj and the t-test jH]. As repeatedly pointed out in 
Refs. |Hl El 1121 1211 EH El 5 the fold-change method is not rigorous from a statistical point 
of view, because it considers neither the variances nor the sample sizes of the data. For 
example, a two-fold increase obtained from narrowly distributed data with 1000 samples 
is statistically more significant than the same increase obtained from broadly distributed 
data with 10 samples. The t-test overcomes this shortcoming by including the variance 
and sample size information. However, the t-distribution is obtained by assuming that the 
random variables are sampled from a normal (Gaussian) distribution. 

There are alternative discriminant methods that do not rely on the assumption that the 
random variables are normally distributed. Out of the four linear classification methods - 
Fisher's linear discriminant analysis, logistic regression (LR), Rosenblatt's perceptron, and 
support vector machine (SVM) - LR and SVM do not rely on this assumption [22, and 
hence they are more robust when the actual data, including the presence of outliers, are not 
normally distributed. Another advantage of LR over t-tests is that t-tests compare only 
two group averages, whereas LRs check each individual sample for consistent differential 
expressions. In the following we focus on LR, which has already been used in discriminant 
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microarray data analyses O EUJ Ell ESI ESI Hi • 

Cross-validation is often used for assessing how accurately a dataset can be classified 
by a learned model. In cross-validation, a dataset is divided into two parts, where the first 
part is used for estimating the model parameters, and the second part is used for assessing 
the classification performance. Due to the splitting of the dataset, not all samples are 
included in the learning process, which is not optimal for datasets with a small number of 
samples. On the other hand, if all data points are used in the training process, the error 
rate of the classifier would be underestimated. 

In order to estimate the statistical significance of a learned model, one usually uses 
resampling methods, such as the bootstrap method (resampling with replacement) or the 
permutation method (resampling without replacement). Since in this paper only the single- 
gene LR is used, a significant model implies a significant gene. (This correspondence 
does not hold for multivariate classifiers due to the possible correlation among genes.) 
In Ref. likelihoods of single-gene LRs of real datasets are compared to those of the 
label-permuted datasets, and genes with a likelihood exceeding the likelihood of the top- 
ranking gene of the permuted data are selected. One problem with actually carrying out 
permutations as in Ref . jnij is that the calculation of the LR likelihoods for ten-thousands 
of genes is computationally intensive, and that repeating this calculation for, say, 10^ 
surrogate datasets is prohibitive. 

Here, we propose an analytic solution that circumvents these heavy computations. Our 
approach is based on the observation that we are only interested in the extreme-values 
in the following sense: in order to define a threshold for gene selection, we compare the 
maximum likelihood of each gene in the real data with the maximum likelihood of the 
top-ranking gene in the label-permuted data. Whereas simulation requires the calculation 
of all single-gene likelihoods in the surrogate data for each permutation, the proposed 
analytic calculation of the the expected value of the likelihood of the top-ranking gene will 
be carried out only once. 

The extreme- value theory is a well studied topic in statistics 0120111015 with major con- 
tributions by Ronald A Fisher, Maurice Frechet, Emil Gumbel, Vilfredo Pareto, Waloddi 
WeibuU, to name just a few. One fundamental assumption often used in deriving an 
extreme- value distribution is that observations are independent. In our application of the 
extreme-value distribution, the corresponding assumption is that log likelihood scores of 
different genes are statistically independent. Clearly, this assumption is violated in most 
expression data sets, but as we discuss in the Discussion section, there is a simple solution 
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to this problem by replacing the number of genes p by the "effective number of genes" Pes- 
The topic studied in this paper is closely related to the multiple testing problem. A 
criterion for claiming statistical significance should be more stringent when many genes 
are tested than if only one gene is tested, because presumably multiple testings provide 
more chances to find a significant gene. Traditionally, the Bonferroni correction, which 
divides the threshold for significance obtained from a single gene by the total number of 
tests (genes), is used in those cases. Applying extreme- value distribution achieves a similar 
goal because the largest value among p variables increases with p, and this effectively raises 
the stringency for a gene selection criterion. 

2 Methods 

2.1 Logistic regression of microarray data 

First, wc introduce the following notation. Let the samples be indexed by i, and let 
the genes be indexed by j- Denote the total number of samples by N, the total number of 
genes by p, the expression level by x, e.g., Xij = log(spot iutcusity of gene j in sample i), 
and the sample label value by y, e.g., y = or y = 1 for a binary classification problem. 
Then, the single-gene LR model Mj of gene j is defined by the conditional probabilities of 
the sample label yi given the expression levels Xij, 

for i = 1, 2, . . . , and j — 1,2, ... ,p. Here, aj and bj are parameters to be estimated 
from all samples i — 1,2,. . . ,N. The data-fitting performance of Mj is measured by the 
maximum likelihood, 

N 

L{D\Mj) = maxn[Pr(2/i = l\xij)r[l - Pr(yi = 1^^,-)]'""% (2) 

where D denotes the data. Since a gene is represented by a LR model, selection of genes 
becomes selection of single-gene LR models with large maximum-likelihoods. Although 
in a more general context such as multivariate models, model selection is not equivalent 
to variable (gene) selection, for single-gene models, gene selection and model selection are 
treated as the same. 

2.2 Maximum likelihood for the surrogate data 

There are different ways of constructing surrogate datasets. For example, one may sam- 
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pie the expression levels Xij from a normal distribution, and then assign a label yi to each 
sample randomly; or one may start with the available microarray data set, and randomly 
permute the sample label. If a gene in the microarray data does not differentially express 
before a permutation, the two ways for generating the surrogate data is the same. However, 
as pointed out in jHEI, if a gene is indeed differentially expressed before a permutation, ex- 
tra variance remains after permutation, and the two methods for generating the surrogate 
data can be slightly different. 

This subtle difference between the two surrogate datasets may affect a t-test result, 
because t-test makes certain assumption on the distribution and variance on the data |SHI- 
The extra variance remained in the permuted data violates this assumption. Nevertheless, 
no such assumption is required for LR. For this reason, we do not make this distinction, 
and denote by Dq a surrogate dataset with permuted sample labels, whether the original 
dataset before permutation contains differentially expressed genes or not. 

We denote by L{DQ\Mj) the maximum likelihood under the single-gene LR model Mj. 
For a particular permutation, we define by 



the maximum value of the maximum likelihoods of all genes. Note the two different 
maximization steps: the first over the parameter values aj and bj for a given gene, and 
the second over all genes j. When surrogate dataset Dq is repeatedly generated, those 
maximum values / vary from realization to realization, and our goal is to characterize 
the distribution of /, e.g. by computing the expected value, the median, or the standard 
deviation of /. 

Toward the calculation of the expected value of /, we use the Wilks theorem jHOI; which 
is "one of the most celebrated folklores in statistics" ^3] and is covered by most standard 
textbooks on mathematical statistics HUl CSl IIHI EI| • This theorem states that, under 
very general conditions (which our LR model satisfies), the asymptotic distribution of 
the 2-log-likelihood ratio - when the data is generated by the null model Mq - is the 
distribution with df degrees of freedom, where df = d{Mj) — d{MQ) is the difference of the 
number of parameters in models M and Mq jSU]. Using our notation, it states that in the 
N oo limit. 



where t denotes a random variable sampled from a distribution with df degrees of 
freedom. 



/ = max[logL{Do\Mj)] 




(3) 
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We choose the null model Mq to be the same for all genes, i.e., Pr(?/j = l\xij) = c for all 
j = 1,2, ... ,p. The maximum likelihood estimate of c is simply the percentage of samples 
that are labeled as 1, i.e., c = Ni/N. The maximum likelihood under Mq is 

L(Do|Mo) = c^^(l-c)^-^S (4) 



and its logarithm is 
where if is the entropy 



log L (Do I Mo) = -NH 



H = log log . 

N ^ N N ^ N 

Note that L{D\Mo) = L{Dq\Mq), because the percentage Ni/N of samples with sample 
label y = 1 is the same in D and Dq. 

Applying the LR model to the surrogate data, we obtain for the best single-gene maxi- 
mum log-likelihood (in the large sample limit — oo): 



I = max 

j 



\ogL{Do\M, 



max 

j 



logL(Do|Mo) + | 



-NH + - max [ti, ^2, ■ ■ ■ ,tp], 



1 
2 

where ti,t2, . . . ,tp are p random variables sampled from a distribution with df degrees 
of freedom. In this example, Mj contains two parameters, and Mo contains one parameter, 
so df = 2-1 = 1. 

2.3 Extreme-value distribution of ^^-distributed random variables 

The extreme-value distribution of normally distributed random variables has been ex- 
tensively studied (see, e.g., Pl:). Gumbel showed [121121] that the extreme-value distribu- 
tion of the distributed variables belongs to the same class as that of normally distributed 
variables, which is now called the standard Gumbel distribution exp{—exp{—{x — a)/b)). 
For the case of distributed variables, the coefficients a and b are derived in Ref. [22] • 
Although this extreme-value distribution (of random variables sampled from the with 
one degree of freedom) is known, for the sake of completeness we present here a derivation. 

Let ti,t2, . . .tp be statistically independent and identically distributed (iid) random 
values from a distribution with one degree of freedom, and define Tp = max[ti, ^2, • • • , ^p]- 
Based on the inequality 0: 

< P.(,. > , < ij.-./^ (5) 
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and by defining 



p2 

Cp = log ■ 



TT log(p) 

one finds that for asymptotically large {p —>■ oo), the cumulative distribution of Vp = 
(Tp — Cp)/2 converges to the double exponential function: 

F^{x) = lim (x) = lim Pr ("Zk^iz < x] = exp(-e-'^). (6) 



■X 



This result can be derived as follows. For any x, we obtain 

Pr f'llL^ <x)= Pr(Tp < Cp+2x) = f[[l- Pt{U >Cp + 2x)] = [1 - Pr(t, > Cp + 2x)f 
and from inequality (0), one obtains 

/^g-log(p)+log^/log{p)+log{v^)-x h log(p) 

lim pPiUi > c„ + 2x) = lim p\ , = lim -^-===e~^ = e~ 

Therefore, 

^lim Pr < = exp(— e~^) 

From the asymptotic distribution Fy{x), we can compute the mean E[v], the median 
m[v], and the standard deviation cr[t>]: 

EH = 7 

mH = -log(log(2)) (7) 



TT 



2 



6 

where 7 ~ 0.5772 denotes the Euler constant. Hence, we obtain the following asymptotic 
scaling for the mean, the median, and the standard deviation of Tp = Cp + 2vp in the 
asymptotic limit p ^ 00: 

E[Tp] ^ 21og(p)-log(log(p))-log(7r) + 27 
m[Tp] ^ 2 \og{p) - log(log(p)) - log(7r) - 2 log(log(2)) 
a[Tp] ^ ^/2^ (9) 

Based on the extreme-value distribution of Tp, we propose the following two gene selec- 
tion criteria. 
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2.4 Gene selection based on the E-value of the extreme-value distribution 

In the first criterion, which we call the E-criterion, we compare the maximum likelihood 
of each gene obtained from the real data with the expected value of the maximum likelihood 
of the top-ranking gene from the surrogate data. This criterion for the likelihood can be 
easily converted to a criterion for the log-likelihood ratio: for each gene j = 1,2, .. .p, 
calculate the log-likelihood ratio 

, 2 log = 2 log = 2 log Horn + 2NH, (10) 

order them such that > t(2) > ^(3) ■ ■ ■ > i(p), and declare genes j — 1,2, ...J as 
differentially expressed if 



t{j) 



> E[Tp] = 21og(p) - log(log(p)) - log(7r) + 27 > t^j+.y (11) 

2.5 Gene selection based on the P-value of the extreme-value distribution 

In the second gene selection criterion, which we call the P-criterion, we compare the P- 
value of the calculated maximum likelihood of each gene obtained from the real data using 
the distribution of the maximum likelihood of the top-ranking gene from the surrogate data. 
That is, for each gene j = 1, 2, . . .p, calculate the log-likehhood ratio tj = 2 log L{D\Mj) + 
2NH, order them to then convert them to = — Cp)/2. We declare genes 
j — 1,2, ... J as differentially expressed if and only if an upper limit of the P-value for 
P(j) < 1 — exp(— e^^('')), is smaller than the user-specified Pq, and that of ^(j+i) is 
larger: 

1 - exp(-e-^(^)) < Po < 1 - exp(-e-''(-'+i)). (12) 

When a small Pq is chosen, such as Pq =0.01 or Po =0.001, the tail distribution of 
the extreme-value is used. In the E-criterion, since it is the mean of the extreme-value is 
chosen, we focus on the middle-range of the extreme value distribution. As a result, the 
P-criterion is more stringent than the E-criterion, leading to fewer genes selected. This is 
on the top of the conservative nature of both E- and P-criteria, because even the non-top 
genes in the real data are compared with the top-maximum-likelihood in the surrogate 
data. 
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3 Results 

3.1 Confirmation of tiie extreme-value distribution by numerical simulation 

We perform numerical simulations to test if, and to which degree, the asymptotic expres- 
sions of the mean -E[Tp], the median m[Tp], and the standard deviation cr[Tp\ are acceptable 
approximations for finite p ranging from 1 to 10^. For each value of p ranging from 1 to 
1.5 X 10^ we generate 10^ samples of p random variables sampled from a distribution 
with 1 degree of freedom. Fig. 1 shows E[Tp], m[Tp], and cr[Tp] versus log(p), and we 
find that the asymptotic expressions of E[Tp] and m[Tp] agree with the simulation data 
sufficiently well. The simulations confirm the trend of a linear increase of Tp with log(p) 
as well as the systematic deviation from this linear trend due to the loglog(p) term. The 
standard deviation <j[Tp] according to Eq.(jni) is not a function of log(p), and indeed, the 
simulated values reach a plateau for p > 10^. Note that the predicted standard deviation 
y27r^/3 is consistently larger than the simulated standard deviation, and the difference 
between the two curves becomes smaller as p increases. 

Besides the mean, median, and variance, we also compare the distribution of Tp for 
finite p with the analytically derived distribution for j9 ^ oo in order to study to which 
degree Eq.® derived for the asymptotic limit p — > oo is an appropriate approximation 
for finite p ranging from 10^ to 10^. We generate p = 6000 random variables ^1,^2? ■ ■ - tp 
sampled from a distribution with one degree of freedom, and we record the maximum 
value Tp = max[ti,t2, • • - tp]. We repeat this sampling process 10^ times, and we compare 
the empirical P- value, which is the percentage of times the Vp = (Tp — Cp)/2 exceeds a 
specified value x, to the theoretical P-value 1 — F^{x) = 1 — exp(— exp(— x)). We find that 
for p = 6000 the two distributions match well. 

3.2 Gene selection for microarray datasets 

We use two publicly available microarray datasets to illustrate the proposed criteria for 
deciding how many high-ranking genes should be selected: (i) the leukemia subtype data 
from the Whitehead Institute ^Tj, and (ii) the colon cancer data from Princeton University 

III 

ALL versus AML: Fig. 2(a) shows the rank-ordered distribution of the maximum 
likelihoods for all single-gene LR models for the discrimination of acute lymphoblastic 
leukemia (ALL) from acute myeloid leukemia (AML). The sample size is 72, which com- 
bines both the training and testing sets, as designated in ^Tj. The ALL- AML classification 
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problem is thoroughly discussed in j221; and it is well-known to be a comparatively easy 
classification problem jSOl ESI ISSl HI] ■ 

According to the E-criterion proposed in Eq.(|lip. 407 genes are selected. In the con- 
verted variable = (t(j) — Cp)/2, the E-criterion is equivalent to f(j) > 7 = 0.5772. 
Using the P-criterion proposed in Eq. lfT^ . we obtain that 165 genes are considered to be 
differentially expressed at the P-value of 0.01 (see Fig. 2(b)). We note in passing that the 
number of genes selected by both criteria is substantially smaller than 1100, which is the 
number of genes labeled as "more highly correlated with the AML-ALL class distinction 
than would be expected by chance," as reported in ^17j using the "neighborhood analysis" . 

T-cell versus B-cell: As pointed out in jTHj, the ALL dataset is still a heterogeneous 
dataset, with sources from B-cells and T-cells being different from each other. Fig. 3(a) 
shows the rank-ordered distribution of the maximum likelihoods using single-gene LR 
models for the B-cell versus T-cell classification, with a reduced sample size of 47. The 
E-criterion declares 114 genes as differentially expressed, and the more conservative P- 
criterion declares 57 genes as differentially expressed with the P-value of 0.01. These 
findings are in agreement with the observation in ^H] that there are differentially expressed 
genes in B-cells and T-cells, and also in agreement with another observation in [221 based 
on cluster analysis. 

Colon cancer versus normal: Fig. 4(a) shows the rank-ordered distribution of the 
maximum likelihoods using single-gene LR models for the colon cancer versus normal tissue 
dataset studied in p. This dataset consists of 62 samples, and the data for 2000 genes 
that have the "highest minimal intensity across the samples" are available from [T]. We 
find that only 49 and 10 genes are selected by the E-criterion and the P-criterion (at Pq- 
value=0.01), respectively, and one possible explanation why these numbers are small is 
that the initial number of genes is already restricted to a smaller number of 2000 by some 
pre-processing method. Another possible explanation is that it the classification task in 
the dataset cannot be accomplished by single-gene models. 

4 Discussions and conclusions 

The gene selection procedure discussed here circumvents the multiple testing problem 
by explicitly including the number of genes (p) in the gene selection criterion. It is an 
analytic approximations based on the known mathematical theorems concerning (i) the 
extreme-value distribution of distributed random variables, and (ii) the asymptotic 
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distribution of the log-likelihood ratio. The analytical approximation developed in this 
paper is based on the following assumptions: (1) ^ cxd so that the distribution of the 
log-likelihood ratio statistics is the distribution; (2) p — cxo so that the extreme-value 
distribution can be applied; (3) the extreme-value is taken from p independent values. In 
the context of microarray data analysis, these assumptions translate to: (1) the number of 
microarray samples is very large; (2) the number of genes p is very large; and (3) the 
maximum likelihood scores of different genes are statistically independent. 

Based on the simulation result presented in Fig. 1, problem (2) may not be a serious 
problem, since the log(p) trend, as well as the log(log(p)) correction, is captured very well 
by the analytic formula, even when p is small. Besides, for a typical microarray data, 
the range of p is large, usually beyond a few thousands. It should be mentioned that 
any asymptotic results (asymptotic for p) are not unique in the sense that adding any 
extra term whose value over Cp tends to zero will also be a valid solution. For example, 
it can be shown that it is possible to replace Cp = 21og(p) — log(log(p)) + log(7r) by 
Cp = 2 log(p) — log[log(p) — log \J\og{p) — log v^] — log(vr)). At finite range of p's, however, 
the difference between different formula can be neglegible. 

Probelm (3) can be handled by introducing an "effective number of genes" pes- For ex- 
ample, if two genes have identical expression profiles, they lead to the identical maximum- 
likelihood scores, and the number of genes should be reduced by one, i.e., peff = p — 1. 
In cDNA arrays, several probes may consist of ESTs originated from the same gene, so 
these probes will give highly correlated expression profiles. Since the exact degree of cor- 
relation is usually unknown, one must estimate the total number of redundant probes, 
and subtract them from p to obtain Peff- As peff < p, and Cp^^^ < Cp, the effect of a 
gene-gene correlations is to relax the gene selection criterion and hence more genes are 
selected. Interestingly, a few recent publications show that gene-specific test scores are 
almost independent jSHlllEI- As a result, the problem (3) may not be a serious problem 
for real data. 

When the multiple testing is considered in a t-test, the gene selection criterion becomes 
more stringent with more number of genes. It is the same situation for the E-criterion 
and P-criterion. Both E- and P-criterion are conservative in the sense that the j-th ranked 
gene is compared to the top-ranked classification performer, instead of the j-th ranked one, 
in the surrogate data. If the E- and P-criterion are compared to each other, we find that 
for small values of Pq, such as Pq = 0.01 or Pq = 0.001, the P-criterion is more stringent 
than the E-criterion. It is because the E-criterion uses the average of the extreme-value 
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distribution whereas the P-criterion uses the tail area of the distribution. 

The conservative nature of the E- and P-criterion yields a side effect that fewer number 
of genes are selected than some other gene selection criteria. This may be a positive or 
negative side effect, depending on the goal of the data analyst. Selecting many genes as 
differentially expressed increases the risk of declaring non-differentially expressed genes 
as differentially expressed, and selecting only a few genes increases the risk of missing 
differentially expressed genes. In the framework of hypothesis testing, one can reduce the 
type-I error (the number of false positives) at the cost of increasing the type-II error (the 
number of false negatives). A too stringent gene criterion reduces the number of false 
positives in the set of selected genes at the cost of missing potentially meaningful genes. 
Whether or not a good balance is reached in the E- and P-criterion can only be judged by 
future applications of these to real data. 
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Figure 1: Numerical simulation of the extreme- values Tp = max[ii, t2, . . . tp] of p random variables 
ti,t2, ■ ■ ■ ,tp sampled from the distribution with 1 degree of freedom. The mean E\Tp\ (solid dots), 
the median 'm\Tp] (triangles), and the standard deviation cr[Tp] (crosses) are plotted against log(p) for p 
ranging from 1 to 1.5 x 10^. The analytic results of the mean, median, and standard deviation by Eq.©, 
which are exact for asymptotic p, are shown in solid lines. For asymptotically large p, both the mean and 
the median of Tp increase with p as ^ 21og(p) — log(log(p)). A linear regression line fitting the mean of 
Tp is displayed in dashed line: E[Tp] w —1.14 + 1.89 log(p) (the fitting range of p is from 10^ to 1.5 x 10^). 
The horizontal solid line is the standard deviation of -\/27r2/3 ~ 2.56. 
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Figure 2: (a) Rank-ordered log-likelihood ratios for ALL versus AML dataset: {j — l,2,...p) 
defined in Ea. Hl()(l . (b) Rank-ordered P-values for the same dataset: P(j) = 1 — cxp(— cxp(— uq))), where 
^(i) ~ (^(j) ~ '^p)/2- {^) E-criterion declares 407 genes as differentially expressed, and in (b) the more 
conservative P-criterion declares 165 genes as differentially expressed. 
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Figure 3: (a) Rank-ordered log-likelihood ratios for T-cell versus B-cell dataset. (b) Rank-ordered P- 
values for the same dataset. In (a) the E-criterion declares 114 genes as differentially expressed, and in 
(b) the more conservative P-criterion declares 57 genes as differentially expressed. 
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Figure 4: (a) Rank-ordered log-likelihood ratios for colon versus normal dataset. (b) Rank-ordered P- 
values for the same dataset. In (a) the E-criterion declares 49 genes as differentially expressed, and in (b) 
the more conservative P-criterion declares 10 genes as differentially expressed. 



